Skip to content

Feat/linear multi backend#58

Draft
gongchensu wants to merge 11 commits intoInfiniTensor:feat/ascend-operatorsfrom
gongchensu:feat/linear-multi-backend
Draft

Feat/linear multi backend#58
gongchensu wants to merge 11 commits intoInfiniTensor:feat/ascend-operatorsfrom
gongchensu:feat/linear-multi-backend

Conversation

@gongchensu
Copy link
Copy Markdown
Contributor

No description provided.

@gongchensu gongchensu self-assigned this Apr 16, 2026
@gongchensu
Copy link
Copy Markdown
Contributor Author

A100编译和算子测试:
image
沐曦编译和算子测试:
image
天数编译和算子测试:
image
摩尔编译和算子测试:
image

@zhangyue207 zhangyue207 force-pushed the feat/ascend-operators branch from 0d93135 to df07f95 Compare April 17, 2026 20:34
zhangyue207 and others added 11 commits April 21, 2026 14:32
… operator split (InfiniTensor#64)

* fix(ci): tolerate docker teardown SIGKILL when pytest passes cleanly

Docker 18.09 occasionally SIGKILLs the container during its `chown`
teardown step, causing `.ci/run.py` to exit 137 even when pytest
completed normally. Parse `/workspace/results/test-results.xml` for
`errors` / `failures` fields and treat 137 as success when pytest
reports no failures.

Also bundles a small Dockerfile update for the Ascend image used by
`.ci/run.py`.

* fix(scripts): align `py::arg` order with C++ lambda params + optional defaults

Two fixes in the pybind11 bindings generator:

1. `py::arg("implementation_index")` was emitted before `py::arg("stream")`
   in the generated `def(...)` call, but the C++ lambda parameters were
   declared in the opposite order. Kwargs then silently swapped — the
   stream integer landed in the impl-index slot, and dispatch SIGABRT'd.
   Re-order so `py::arg` entries are positional-consistent with the C++
   lambda signature.

2. Only `std::optional<Tensor>` parameters had a `= py::none()` default;
   `std::optional<int64_t>` (and other scalar optionals) had no default,
   forcing callers to pass them explicitly. Generalize the default
   emission to all `std::optional<...>` parameters.

* feat(ascend): framework scaffolding + custom_kernel build infra

Framework headers shared across all Ascend operators:
- `common.h`: `AclTensorCache` descriptor-caching + `toAclDtype` helpers
- `workspace_pool_.h`: stream-scoped `WorkspacePool` with named arenas;
  `GetWorkspacePool()` / `Pool::Ensure()` entry points (matches master
  PR InfiniTensor#60 naming)
- `atb_common_.h`: ATB `Context` management + `toAtbTensor` helper for
  operators wrapping ATB APIs
- `data_type_.h`, `device_.h`: `TypeMap<Ascend, T>` + `Runtime` specialization
- `runtime_.h` is the existing file; left untouched by this PR

`custom_kernel/` ships the AscendC standalone build system for custom
kernels. Gated by its own `CMakeLists.txt`; produces
`libascend_kernel.so` consumed by `kernel_custom.h` op variants (landed
in follow-up category PRs).

* feat: core framework, build, and test infra for Ascend operator split

Shared changes needed by every Ascend operator PR:

- `src/hash.h` + `src/operator.h`: cache-key plumbing used by
  `Operator<Op, device>` dispatch
- `src/pybind11_utils.h`: tensor / optional-tensor / vector-tensor
  pybind11 casters used by the generator output
- `CMakeLists.txt` + `src/CMakeLists.txt`: Ascend build target, atb
  discovery, `WITH_ASCEND` option
- `tests/conftest.py`: `auto_act_and_assert` fixture + device
  parametrization (`--devices ascend/nvidia/...`)
- `tests/utils.py`: `Payload`, `randn_strided`, `get_npu_stream`, and
  similar test helpers shared by every `tests/test_<op>.py`

* test(conftest): auto-skip tests whose op has no impl on the target device

Adds a `skip_op_without_platform_impl` autouse fixture that derives the
InfiniOps class name from the test module filename
(`tests/test_<snake>.py` → `<Snake>`) and checks
`active_implementation_indices` for the parametrized device. When the
op has no backend specialization on the current branch, the test is
skipped instead of SIGABRTing through `Operator<Op, device>::Make()`.

This is essential for the operator split: each per-category branch
contains only its category's Ascend impls but inherits test files for
all operators from master. Without this guard, `pytest tests/
--devices ascend` crashes on ops lacking ascend impls on the branch.

* chore(custom_kernel): drop perf/design docs and standalone .so pytest files

Remove content that duplicates what the pytest integration tests
(`tests/test_rms_norm.py`, `tests/test_add_rms_norm.py`) already
cover, or that's developer scratchpad rather than checked-in
artifact:

- `csrc/ops/rms_norm/{README,design}.md` — design scratch
- `csrc/ops/rms_norm/test/{benchmark_rms_norm_msprof,run_rms_norm_case}.py`,
  `rms_norm_cases.jsonl`, `rms_norm_perf_report.md`,
  `rms_norm-test-cases.md` — per-op perf benchmarking + reports
- `tests/test_{rms_norm,add_rms_norm}.py` under custom_kernel/ —
  redundant with the top-level pytest integration tests

Build infra, kernel sources, registration, and utility headers are
unchanged; the `libascend_kernel.so` artifact and its consumers
(`kernel_custom.h` variants in the op-norm-rope PR) are unaffected.

* style(scripts,custom_kernel): fix Python blank-line hygiene + drop redundant .gitignore entry

Review items 1-5 on `scripts/generate_wrappers.py`:
- Restore docstring quoting in `_find_optional_tensor_params` (reverts
  accidental change to ```int`` and the double-space).
- Restore blank lines before `return` in `_find_optional_tensor_params`,
  `_is_optional_tensor`, and `_generate_params` / `_generate_arguments`
  (project CLAUDE.md Python style: "blank line before `return` unless
  inside a block body").
- Add missing blank line before `return` in `_find_vector_tensor_params`
  and `_is_vector_tensor`.
- Drop redundant `import re` inside `_find_vector_tensor_params` — `re`
  is imported at module level.

Review item 10 on `src/ascend/custom_kernel/.gitignore`:
- Drop redundant `build/` entry (already ignored globally via the
  project-root `.gitignore`). Keep `output/` and `python/` — both are
  AscendC-specific build artifacts not covered by the root ignore.

* refactor(custom): rename `custom_kernel` → `custom` and flatten to match vllm-ascend/csrc layout

Reviewer top-level feedback on PR InfiniTensor#64: mirror the directory layout of
https://github.com/vllm-project/vllm-ascend/tree/main/csrc and drop the
extra nesting layers.

Directory changes:
- `src/ascend/custom_kernel/` → `src/ascend/custom/`
- Merge `csrc/` into the top: move `csrc/register.cpp`,
  `csrc/ops.h`, `csrc/utils/` up one level.
- Rename `register.cpp` → `torch_binding.cpp` to match vllm-ascend naming.
- Promote `csrc/ops/<op>/` to `<op>/` at the top (drop the `ops/` layer).
- Merge `csrc/CMakeLists.txt` content into top-level `CMakeLists.txt`;
  delete the now-empty `csrc/` layer.
- Remove `src/ascend/custom_kernel/.gitignore` (root `.gitignore`
  already ignores `build/`; `output/`+`python/` were custom_kernel-scoped
  build artifacts that fit the root gitignore's scope too).

Resulting layout:
  custom/
  ├── build.sh
  ├── CMakeLists.txt
  ├── cmake/{config_ascend,config_envs}.cmake
  ├── ops.h
  ├── torch_binding.cpp           (was `register.cpp`)
  ├── utils/torch_kernel_helper.h
  ├── rms_norm/{op_host,op_kernel}/rms_norm.cpp
  └── add_rms_norm/{op_host,op_kernel}/add_rms_norm.cpp

License preservation: files shared in structure/substance with
vllm-ascend (`torch_binding.cpp`, `ops.h`, `utils/torch_kernel_helper.h`,
top-level `CMakeLists.txt`) now carry proper Apache License 2.0 headers
with the original Huawei Technologies copyright preserved alongside
InfiniTensor's modification copyright.

Callers:
- `src/CMakeLists.txt`: `custom_kernel` → `custom` in two references.
- Root `CMakeLists.txt`: updated inline comment pointing to the build
  script.
- Library name (`ascend_kernel`), static lib (`no_workspace_kernel`),
  and Python module name remain unchanged — `kernel_custom.h` consumers
  in the op-norm-rope PR link via those identifiers, not by path, so
  this rename does not ripple into that branch.

CI: `.ci/run.py --local --gpu-id 0` passes 3072/1782 on Ascend 910B
with `BUILD_CUSTOM_KERNEL=OFF` (default); the custom kernel build
itself is exercised by the op-norm-rope PR's `kernel_custom.h`
integration.

* style: fix PR InfiniTensor#64 review patterns found outside `custom/`

Scan-and-fix pass for patterns flagged in reviewer comments on
`custom_kernel/` that also appear in other files in this PR.

- `src/ascend/common.h`: wrap `aclTensor` in backticks in two comments
  (matches comment 9 on Markdown formatting in custom_kernel).
- `tests/utils.py`: add missing blank line before trailing `return` in
  `get_stream()` (matches comments 3/5 on missing blank line before
  return in non-block-body context).

No camelCase-local violations in the framework C++ headers
(atb_common_, common, data_type_, device_, workspace_pool_, hash,
operator, pybind11_utils) — reviewer comment 6 was specific to
`custom/` op_host code adapted from vllm-ascend.

* style(custom): address PR InfiniTensor#64 review comments 6+7 — C++ naming

Reviewer @voltjia on PR InfiniTensor#64 inline comments:
- Comment 6: local variables must follow Google C++ Style Guide
  (`dimLength` → `dim_length`, etc.). Applied across all locals in the
  two op_host files.
- Comment 7: namespace `ascend_kernel` is non-standard; use `detail` or
  `ascend::detail` to match other platforms. Renamed to
  `ascend::detail` in `ops.h`, `torch_binding.cpp`,
  `utils/torch_kernel_helper.h`, and both `op_host/*.cpp` files.

The library name (`ascend_kernel` → `libascend_kernel.so`), `OP_PLUGIN_NAME`,
and Python-import name are unchanged — those are compile/link identity
and are independent of the C++ namespace.  `kernel_custom.h` in
op-norm-rope links via the C `extern` launch symbol, not the namespace,
so this rename does not ripple into that branch.

Also took the opportunity to backtick-wrap identifiers in comments that
the rename touched.

Inline comments 8 and 9 (Markdown formatting in comments) were already
covered by the backtick pass in commit 0aed3a5 for non-custom files; the
custom/ comments here also get normalized as a side-effect of rewriting
the affected lines.

* review(pr#64): address remaining unresolved inline comments

Scanned ALL 30 inline comments on PR InfiniTensor#64 (not just the 10 visible in
collapsed view). 22 had been missed by the earlier passes.

Generator (scripts/generate_wrappers.py):
- Comments 8-10: swap `stream` and `implementation_index` in both the
  pybind lambda parameters and the `py::arg` declarations, to match
  the `Operator::Call(Handle, Config, ...)` order (Handle first, Config
  second). Previously ordered impl_index first for lambda-signature
  alignment; with the swap, both are reordered together so kwargs
  still resolve correctly.
- Comment 11: restore backticks around device names in `--devices` help
  text.
- Comment 12: `.def_static("clear_cache", ...)` kept — it is the API
  used by the new `_clear_operator_caches` pytest fixture.

CMakeLists.txt:
- Comments 13-14: wrap `NEEDED` and `torch_npu` in Markdown backticks in
  comments.

tests/conftest.py (comments 23-29):
- Reset the file to master's content and re-apply only the two new
  fixtures (`_clear_operator_caches`, `skip_op_without_platform_impl`)
  with Markdown docstrings (single backticks, not rST double). Reverts
  incidental changes to `pytest_addoption` help text,
  `skip_unsupported_dtypes` rename, `_PLATFORM_TO_TORCH_DEVICE` dict
  order, `_resolve_device` docstring, and the `torch_npu` comment
  line-wrap.
- Fix comment 27's concern: `_TORCH_DEVICE_TO_PLATFORMS` now maps one
  torch device type to multiple platforms (`cuda` →
  `{nvidia, metax, iluvatar}`) and `skip_op_without_platform_impl`
  checks `active_implementation_indices` across all of them; it skips
  only when every mapped platform reports empty.

tests/utils.py:
- Comment 16: remove `get_npu_stream`; `get_stream(device)` covers all
  torch device types.

tests/test_{add,causal_softmax,gemm,rms_norm,swiglu}.py:
- Comments 17-22: replace the `if device.type == "npu"` branches with a
  single call that passes `stream=get_stream(<tensor>.device)`. Single-
  line import restored in `test_add.py` (comment 22 — format minimization
  after dropping the `get_npu_stream` import).

test_gemm.py specifically: moved the "impl=2 on Ascend is broken because
of `src/torch/gemm/gemm.h` SFINAE pollution" workaround from the
helper-level conditional into a `pytest.skip` at the top of the test
body, so the helper itself becomes unconditional.

* style: fix clang-format and ruff format violations

- `src/ascend/custom/utils/torch_kernel_helper.h`: clang-format wrapped
  a long `ConvertTypes` macro continuation.
- `tests/test_add.py`: ruff `format` wrapped the 5-import `tests.utils`
  line (89 chars, over the default 88 limit) back into multi-line
  form. Reviewer comment 22 suggested restoring a single line after
  dropping `get_npu_stream`, but with `get_stream` added the shortened
  form still exceeds the ruff line-length cap.

* style(comments): wrap remaining technical identifiers in Markdown backticks

Scan-and-fix pass for identifiers in comments that still lack Markdown
backticks, matching reviewer comments 9, 11, 13, 14 on PR InfiniTensor#64. Applied
only to files authored / modified by this PR (leaves custom/cmake/
config_envs.cmake and similar vllm-ascend-verbatim content untouched to
stay consistent with the upstream it was adapted from).

- `CMakeLists.txt`: `pybind11` (line 7).
- `src/ascend/common.h`: `shape`, `strides`, `storage_shape`, `dtype`
  in the `AclTensorCache` class doc.
- `src/ascend/custom/CMakeLists.txt`: `AscendC` toolchain reference.
- `src/ascend/custom/build.sh`: `AscendC`, `libascend_kernel.so`.
- `src/ascend/custom/cmake/config_ascend.cmake`: `SOC_VERSION`,
  `CANN`, `AscendC`.

* style(ascend): rename free-function helpers from camelCase to PascalCase

Per Google C++ Style Guide §Function Names: ordinary non-accessor
functions are PascalCase. Accessors/mutators (get/set on class members)
are snake_case. These 7 are standalone helpers / converters /
predicates — not member accessors — so they need PascalCase.

  threadLocalAtbContext → ThreadLocalAtbContext
  getAtbContext          → GetAtbContext
  toAtbTensor (×2)       → ToAtbTensor
  isAclRuntimeAlive      → IsAclRuntimeAlive
  buildAclTensor         → BuildAclTensor
  toAclDtype             → ToAclDtype
  isIntegerDtype         → IsIntegerDtype

CANN APIs (`aclrtGetDevice`, `aclCreateTensor`, …), STL/PyTorch interop
methods (`begin`/`end`/`size`/`data`/…), and class accessors
(`get_<field>`/`set_<field>`) are all kept as-is — they either belong
to another vendor or match the "looks like a variable" exception.

Callers on the three category branches (op-simple / op-norm-rope /
op-cache-attn) will pick up the new names automatically on rebase.

* style(ascend): reformat CMake comments + restore `Gemm`/`MatMul` backticks in common.h

- `src/ascend/custom/cmake/config_envs.cmake`: capitalize + period + Markdown backticks on all comments and status messages.
- `src/ascend/custom/cmake/config_ascend.cmake`: fix `CANN` casing and backticks in the fatal-error message.
- `src/ascend/custom/CMakeLists.txt`: polish status messages and inline comments (Markdown backticks + sentence case).
- `src/ascend/common.h`: restore `Gemm` and `MatMul` backticks in the `BuildAclTensor` docstring per PR InfiniTensor#64 review.

* refactor(ascend): address PR InfiniTensor#64 review — clean headers, Markdown in `TORCH_CHECK`, Google C++ naming

- `workspace_pool_.h`: uncomment `<cinttypes>` / `<cstdio>` (needed for `PRIu64` and `fprintf` in the destructor; not transitively available on all platforms).
- `device_.h`: switch relative `../device.h` to absolute `device.h` — the historical `src/ascend/device.h` naming collision is no longer relevant.
- `custom/{add_rms_norm,rms_norm}/op_host/*.cpp`: drop unneeded BSD-3-Clause headers and switch `TORCH_CHECK` messages to Markdown-backticked identifiers.
- `custom/{add_rms_norm,rms_norm}/op_kernel/*.cpp`: drop unneeded BSD-3-Clause headers.
- Rename wrapper functions to PascalCase per Google C++ Style: `add_rms_norm` → `AddRmsNorm`, `rms_norm` → `RmsNorm` (ops.h + torch_binding.cpp updated; `torch.ops.npu.rms_norm` registry name unchanged; kernel entry-point names stay snake_case as required by `EXEC_KERNEL_CMD`).

---------

Co-authored-by: zhangyue <zhangyue@example.com>
…near (InfiniTensor#65)

* feat(ascend): op-simple group — Add, Mul, Cast, Cat, Matmul, Gemm, Linear

Seven foundational Ascend operators:

| op | impl |
|---|---|
| Add | aclnnAdd |
| Mul | aclnnMul |
| Cast | aclnnCast |
| Cat | aclnnCat |
| Matmul | aclnnMatmul |
| Gemm | aclnnMm (also carries the cached-executor / workspace-pool rework) |
| Linear | aclnnMatmul + optional bias |

Also ships:
- `src/base/<op>.h` for the 5 new ops (cast/cat/linear/matmul/mul);
  `add.h` and `gemm.h` existed on master and are updated in-place
- `src/cpu/<op>/<op>.h` reference impls for cast/cat/linear/mul (add/gemm/matmul
  had CPU refs on master already)
- `tests/test_<op>.py` for each operator (add and gemm have MODIFY diffs;
  others are new)

* fix(ascend): Add/Cat destructor — use `release()` for executor-owned caches

- `add/kernel.h`: swap destroy() → release() on in_cache_/oth_cache_/out_cache_
  and drop aclDestroyAclOpExecutor (both are referenced by the Repeatable
  executor; destroying them causes double-free at shutdown per the pattern
  documented in common.h and commit 64c367c).
- `cat/kernel.h`: release all in_caches_[i] in the destructor; without it,
  ~AclTensorCache() on vector teardown double-frees descriptors held by
  tensor_list_ / executor_.
- Also group the alpha_* storage members with blank lines to match file
  convention.

* test: generate `implementation_index` dynamically from `active_implementation_indices`

Replaces hardcoded `(0, 1)` / `(0, 1, 2)` tuples in test_add, test_gemm,
test_rms_norm, test_swiglu with a union over the locally-available devices'
active implementation indices.

New helper `tests.utils.all_active_implementation_indices(op_cls)` only
iterates `get_available_devices()` to avoid `DispatchFunc::std::abort` on
device types outside the build's `ActiveDevices` set.

Effect on Ascend CI: skipped-test count drops from 3246 to 1686 — impl=1
(`cuBLASLt`) no longer parametrized when no CUDA device is visible, and
RmsNorm/Swiglu's custom-kernel slot drops out of the matrix on op-simple
where the framework layer hasn't merged the AscendC impl yet.

* test(conftest): joint `(device, implementation_index)` parametrize

Replaces the per-test `@pytest.mark.parametrize("implementation_index", ...)`
+ runtime `if impl not in active_indices: skip` pattern with a single hook in
`conftest.pytest_generate_tests` that emits only the (device, impl) pairs
actually active on each device.

Rationale: kernel dispatch is per-device, so cross-device union (previous
`all_active_implementation_indices` helper) polluted the matrix with impls
that the selected device can't run — runtime-skipped noise.  Joint generation
keeps the matrix to its semantic cell: "this device has this impl, so run it".

- `tests/conftest.py`: when both `device` and `implementation_index` are in
  fixturenames, emit pairs via `op_cls.active_implementation_indices(dev)`;
  fall back to a skipped placeholder (`id="skip"`) when no device has an
  active impl, avoiding `[NOTSET-...]` test IDs.
- `tests/{test_add,test_gemm,test_rms_norm,test_swiglu}.py`: drop the hardcoded
  `implementation_index` parametrize decorator and the runtime `active_indices`
  guard — conftest now handles both.
- `tests/utils.py`: remove the `all_active_implementation_indices` helper
  (superseded by per-device generation in conftest).

Same test outcome on Ascend CI (1935 passed / 1686 skipped) but the remaining
skips are now either semantically mandatory (uint dtypes unsupported by
`torch_npu`, Gemm impl=2 SFINAE-only workaround, op missing ascend impl on
op-simple pending PR InfiniTensor#66) rather than mechanism artifacts.

* refactor(conftest): dedupe `_op_class_from_module`, short-circuit redundant fixture

Post-review cleanup of the joint-parametrize refactor (1dd288f):

- Extract `_op_class_from_module` as a shared helper; `skip_op_without_platform_impl` fixture now calls it instead of re-deriving the snake→pascal class name inline.
- Short-circuit the fixture when `implementation_index` is already in callspec — `pytest_generate_tests` has already pruned empty-impl pairs, so per-case `active_implementation_indices` calls are wasted work.
- Drop `try/except ImportError` inside the helper — collection has already imported `infini.ops` via test modules; masking a real import failure only turns it into a cryptic NOTSET fixture.
- Drop the `devices[0] if devices else "cpu"` fallback — `get_available_devices()` always includes `"cpu"`, making the `else` arm unreachable.

* refactor(cpu): flatten nested `DispatchFunc` in Cast; snake_case variables in Linear

Per PR InfiniTensor#65 review:

- `src/cpu/cast/cast.h`: replace nested `DispatchFunc(in_dtype, ...)` inside
  `DispatchFunc(out_dtype, ...)` with a single multi-dispatch call
  `DispatchFunc<kCpu, AllTypes, AllTypes>({in, out}, [](in_tag, out_tag) {...})`
  per the multi-dispatch idiom documented in `CONTRIBUTING.md`.
- `src/cpu/linear/linear.h`: rename PascalCase locals to snake_case:
  `A/B/Out/Bias` → `a_ptr/b_ptr/out_ptr/bias_ptr`,
  `A_batch/B_batch/Out_batch` → `a_batch/b_batch/out_batch`,
  `M/N/K` → `m/n/k` (matching master's `src/cpu/gemm/gemm.h` which already
  uses lowercase dim names `m_/n_/k_`).

* refactor(cpu/linear): drop redundant `&& bias` guard + narrating comment

- `if (bias_ptr && bias)` → `if (bias_ptr)` (line 75). `bias_ptr` is
  `nullptr` iff `!bias` by construction at line 38, so `&& bias` is dead.
- Remove `// Determine `m`, `n`, `k` from shapes and transpose flags.` —
  the three lines below literally do exactly that; self-describing now that
  names are snake_case.

---------

Co-authored-by: zhangyue <zhangyue@example.com>
- pass -std=c++17 through CMAKE_CUDA_FLAGS for Iluvatar clang builds

Co-authored-by: zhuyue <zhuyue@qiyuanlab.com>
…nfiniTensor#70)

* chore(lint): add .clang-tidy for Google-style naming enforcement

`clang-format` only enforces whitespace/braces/include order — naming
violations (`BUFFER_NUM`, `dimLength`, `inQueueX1`, missing private-member
trailing `_`, etc.) pass silently.  This PR adds `clang-tidy` with
`readability-identifier-naming.*` wired to the Google C++ Style Guide so
the `code-lint` skill can catch them.

- `.clang-tidy` at repo root: types `PascalCase`, functions `PascalCase`,
  variables / parameters `snake_case`, private members `snake_case_`,
  constants `kPascalCase`, macros `UPPER_CASE`, namespaces `lower_case`.
  Only `readability-identifier-naming.*` is `WarningsAsErrors`; the
  `google-*` / `modernize-*` checks are advisory.
- `src/ascend/custom/.clang-tidy`: relaxes `FunctionCase` to `lower_case`
  because `ascendc_add_operator(OP_NAME …)` dictates snake_case kernel
  entry symbol names that cannot be `PascalCase`d.
- `src/ascend/custom/rms_norm/op_kernel/.clang-tidy`: disables all checks
  for device code compiled by `ccec` (absent from `compile_commands.json`,
  `__aicore__` macro parses incorrectly without `kernel_operator.h`).
- `pyproject.toml`: turns on `CMAKE_EXPORT_COMPILE_COMMANDS` so every
  editable `pip install` emits `compile_commands.json` for `clang-tidy`.
- `src/device.h`: adds missing `<string>` / `<string_view>` includes —
  pre-existing transitive-include bug surfaced by `clang-tidy`'s stricter
  parsing.

* chore(pr70-review): address review comments

- `pyproject.toml`: wrap `scikit-build` in backticks; insert blank line
  between build-related defines and tool-related defines.
- `.clang-tidy`: rewrite section divider comments as complete sentences
  ending in a period, per project convention.

---------

Co-authored-by: zhangyue <zhangyue@example.com>
… RmsNorm, AddRmsNorm, ApplyRotaryPosEmb, RotaryEmbedding (InfiniTensor#66)

* feat(ascend): op-norm-rope group — Swiglu, SiluAndMul, CausalSoftmax, RmsNorm, AddRmsNorm, ApplyRotaryPosEmb, RotaryEmbedding

Seven layer-level Ascend operators:

| op | impl |
|---|---|
| Swiglu | aclnnSilu + aclnnMul (decomposed); `kernel_fused.h` wraps fused swiglu where available |
| SiluAndMul | custom AscendC kernel |
| CausalSoftmax | aclnnSoftmax + pre-computed mask |
| RmsNorm | aclnnRmsNorm (kernel.h); custom AscendC variant (kernel_custom.h) |
| AddRmsNorm | 3 impls: decomposed aclnnAdd+aclnnRmsNorm (kernel.h); fused aclnnAddRmsNorm (kernel_fused.h); custom AscendC (kernel_custom.h) |
| ApplyRotaryPosEmb | aclnnApplyRotaryPosEmbV2 (kernel.h); ATB RopeParam (kernel_atb.h) |
| RotaryEmbedding | **3 impls**: aclnnApplyRotaryPosEmbV2 (kernel.h); ATB RopeParam with both neox/interleave (kernel_atb.h); aclnnRopeWithSinCosCache for partial rotary (kernel_sincos_cache.h) |

Bundles the RotaryEmbedding API alignment: `query_out` / `key_out`
are now `std::optional<Tensor>` — omitted → inplace on `query` / `key`
(matches vLLM `RotaryEmbedding.forward(positions, query, key)`).

New `src/base/<op>.h`: apply_rotary_pos_emb, silu_and_mul.
Modified: add_rms_norm (constructor signature alignment),
rotary_embedding (optional query_out/key_out).

* fix(ascend): norm/swiglu destructors + missing add_rms_norm custom kernel registration

- swiglu/kernel_fused.h: release() cat_out_cache_ and out_staging_cache_
  to avoid double-free; drop aclDestroyTensorList per 64c367c convention.
- silu_and_mul/kernel.h: release() out_staging_cache_ to avoid double-free.
- custom/CMakeLists.txt: add add_rms_norm sources to OP_SRCS and register
  its op_kernel via ascendc_library(no_workspace_kernel ...); without
  this, aclrtlaunch_add_rms_norm has no backing implementation.

* style(ascend): rename `AddRmsNorm` parameters to PyTorch-aligned names

- `x1/x2/gamma/y_out/x_out` -> `input/other/weight/out/rstd_out`.
- Propagate through base header, all three Ascend kernel variants
  (`kernel.h`, `kernel_fused.h`, `kernel_custom.h`), and test file.
- Remove stale `rstd_shape_` field from base (unused; `kernel.h` holds
  its own copy).
- Upgrade assertion messages to complete sentences with backticked
  identifiers.

* style(ascend): comment + assert message audit for norm/swiglu/softmax kernels

- Wrap `aclnn*` / `aclrt*` identifiers in backticks and ensure
  complete-sentence, period-terminated comments per CONTRIBUTING.md.
- `silu_and_mul` base header: upgrade assertion message to a
  complete sentence with backticked identifiers.
- Files touched: causal_softmax/kernel.h, rms_norm/kernel.h,
  swiglu/kernel.h, swiglu/kernel_fused.h, base/silu_and_mul.h.

* test(silu_and_mul): add `implementation_index` parametrize and strided coverage

- Wire `implementation_index` into joint `(device, implementation_index)`
  parametrize via conftest; enforces fixture symmetry with `test_swiglu.py`.
- Add two non-contiguous shape cases to exercise the staging-buffer copy
  path in `src/ascend/silu_and_mul/kernel.h`.

* refactor(ascend/rotary_embedding): unify RotaryEmbedding and ApplyRotaryPosEmb base ops

Merge the two rope base headers into one vLLM-compatible op matching
`RotaryEmbedding.forward(positions, query, key=None) -> (query, key|None)`.
`key` becomes `std::optional<Tensor>` (MLA), `query_out` / `key_out` remain
optional for the vLLM-native inplace path, and a new `bool pre_gathered`
constructor flag folds the old `ApplyRotaryPosEmb` fast path into the
unified op.

Kernel updates across all three Ascend impls:
- impl 0 (`aclnnApplyRotaryPosEmbV2`) and impl 1 (ATB `RopeParam`) accept
  the optional `key` / out tensors and honor `pre_gathered` (skipping
  internal `aclnnIndexSelect` when the caller has pre-gathered).
- impl 0 and impl 1 re-upload the expanded cos/sin tables on cache-pointer
  change (reviewer-flagged stale-pointer bug).
- impl 2 (`aclnnRopeWithSinCosCache`) destroys its per-call
  `aclOpExecutor` instead of leaking it (reviewer-flagged leak).
- Uppercase locals (`D`, `T`, `Nq`, `Nkv`, `half_D`, `hiddenQ`,
  `hiddenK`) renamed to snake_case, and `uploadCosSinCache` renamed to
  `UploadCosSinCache` per Google C++ style.

* feat(scripts/generate_wrappers): emit `apply_rotary_pos_emb` Python shim

After the `ApplyRotaryPosEmb` base class was folded into the unified
`RotaryEmbedding` op, vllm-infini still calls
`infini.ops.apply_rotary_pos_emb(...)` — preserve that symbol as a
pybind11 Python-level shim bound alongside the generated
`rotary_embedding` binding.

The shim un-expands the caller's neox-duplicated `[T, head_size]` cos /
sin halves, concats into a `[T, head_size*2]` pre-gathered cache,
synthesizes `positions = arange(T)`, and forwards to the unified op
with `pre_gathered=True`.  No vllm-infini changes are needed.

* test(rotary_embedding): merge apply_rotary_pos_emb cases + cover MLA/3D/partial

Consolidate `test_apply_rotary_pos_emb.py` (deleted separately) into
`test_rotary_embedding.py`:

- `test_apply_rotary_pos_emb`      — pre-gathered fast path through the
  new Python shim; asserts bit-exact parity against
  `infini.ops.rotary_embedding` on the same data.
- `test_apply_rotary_pos_emb_3d`   — 3D `[T, Nq, D]` / `[T, Nkv, D]`
  layout through the shim (reviewer gap).
- `test_rotary_embedding_partial`  — extend to cover
  `is_neox_style=False` on impl 2 (`aclnnRopeWithSinCosCache`),
  matching the reviewer's partial-rotary gap on the non-neox path.
- `_ref_rotary_embedding` now tolerates `key=None` (MLA).

* fix(generate_wrappers): propagate scalar param defaults to pybind signature

Without this, the unified `RotaryEmbedding`'s new `bool pre_gathered`
parameter became a required positional kwarg on the Python side, breaking
every existing `infini.ops.rotary_embedding(...)` caller that did not
pass it.  Regex-scan the base header for `<scalar_type> name = <literal>`
patterns and emit `py::arg(name) = <literal>` in `_generate_py_args`.

Also restore the default on the virtual `operator()` override in
`src/base/rotary_embedding.h` so the regex picks it up.

* fix(ascend/rotary_embedding): correct pre-gathered layout + revert sincos executor destroy

Two in-flight regressions from the previous commit:

1. The `pre_gathered=true` path in kernel.h / kernel_atb.h assumed the
   caller's `cos_sin_cache` is `[T, head_size*2]` (dim-1 concat), but
   that layout can't be split with a flat byte offset because row-major
   contiguous layout interleaves cos and sin per row.  Change the wire
   format to `[2T, head_size]` (dim-0 concat) so the first
   `T * head_size * elem_sz` bytes are contiguous cos and the next
   are contiguous sin; update both kernels and the `apply_rotary_pos_emb`
   Python shim to match.

   Also set the initial `sin_v2_cache_` base pointer to the sin offset
   so the V2 executor captures distinct cos/sin addresses on first call.

2. `kernel_sincos_cache.h` (impl 2) SIGABRTs when the per-call
   `aclOpExecutor*` is destroyed right after `aclnnRopeWithSinCosCache`
   — the kernel is async on the stream and the executor backs the
   enqueued launch.  Revert the `aclDestroyAclOpExecutor` call (still
   leaks, but matches the prior behavior that passed all partial-rotary
   tests) and leave a TODO for proper Repeatable-executor caching once
   the input-address index layout for this kernel is confirmed.

* test(rotary_embedding): fix GPT-J reference for partial rotary

The GPT-J-style branch in `_ref_rotary_embedding` indexed `x[t, :, 0::2]`
and `x[t, :, 1::2]` across the full `head_size` — correct only when
`rotary_dim == head_size`.  For partial rotary, only the first
`rotary_dim` features rotate; restrict slices to `0:R:2` and `1:R:2`.

* refactor(pr66-simplify): correct `rstd_out` semantic name + clarity fixes

Post-merge /simplify review findings applied:

- **`AddRmsNorm` param rename** (`src/base/add_rms_norm.h` + 3 Ascend kernels + test):
  `rstd_out` → `residual_out`.  The slot actually holds `xOut` (the
  `input + other` residual sum) per `aclnnAddRmsNorm`'s API — the internal
  `rstd_tensor_` reciprocal-std buffer is private.  Prior name was
  misleading.
- **Generator shim for `apply_rotary_pos_emb`** (`scripts/generate_wrappers.py`):
  rename the `head_size`-as-`rotary_dim` positional forward to a named local
  `rotary_dim_shim` + comment noting the legacy shim assumes full rotary
  (`rotary_dim == head_size`).
- **`kernel_sincos_cache.h` leak comment**: TODO → FIXME with persistent-worker
  impact call-out.  Actual fix still blocked on undocumented input-address
  index layout for `aclnnRopeWithSinCosCache`.

Skipped findings: reviewer false positives on `src/base/rotary_embedding.h`
members (all consumed by kernels) and `max_seq_len_` (used in constructor
body).  Larger refactors (UploadCosSinCache + IndexSelect helpers, ~100
lines copy-paste) deferred to a follow-up PR.

* style(tests): ruff format `test_add_rms_norm.py` after `residual_out` rename

* build(ascend-custom): drive `build.sh` from `pip install` with proper dep tracking

In-tree `ascendc_library()` trips a `CANN` `extract_host_stub.py` path
bug (`KeyError` on `/./workspace/...` paths in `$<TARGET_OBJECTS>`)
whenever it runs under `scikit-build-core`'s temp-dir builds.  Standalone
`src/ascend/custom/build.sh` avoids the bug by invoking a separate
`cmake` with `src/ascend/custom/` as its `SOURCE_DIR`.  This commit
drives `build.sh` from the main build so devs / CI get a working install
from a single `pip install` call.

- `option(BUILD_ASCEND_CUSTOM …)` replaces the old `BUILD_CUSTOM_KERNEL`
  (name is Ascend-specific now that the driver is CMake-native) and
  **defaults to ON**.  Non-Ascend builds ignore it (gated by
  `WITH_ASCEND` in `src/CMakeLists.txt`); users who don't want the
  `ccec` build on Ascend pass `-DBUILD_ASCEND_CUSTOM=OFF`.

- `src/CMakeLists.txt` registers `build.sh` as a build-phase
  `add_custom_command(OUTPUT …/libno_workspace_kernel.a)` with explicit
  dependencies on every `src/ascend/custom/**/*.{cpp,h}` file (via
  `file(GLOB_RECURSE … CONFIGURE_DEPENDS)`) — edits to any `op_host/` or
  `op_kernel/` source now re-trigger the build, instead of silently
  reusing a stale `.a`.  The outer `scikit-build-core` env (`CMAKE_GENERATOR`,
  `CMAKE_EXPORT_COMPILE_COMMANDS`, …) is scrubbed via `cmake -E env
  --unset=…` before invoking `build.sh` — leaving them set makes the
  nested `cmake`'s `ninja` generator emit the bug-triggering
  `/./workspace/...` paths even though the outer configure dir is clean.

- `src/ascend/custom/cmake/detect_soc.cmake` holds
  `infiniops_detect_soc(<out>)`, which parses `npu-smi info` for the
  first `910*` / `310*` entry and falls back to `Ascend910B4`.  Both
  `src/CMakeLists.txt` (outer build) and
  `src/ascend/custom/cmake/config_ascend.cmake` (sub-build driven by
  `build.sh`) `include()` this file — SOC detection lives in one place.

- `src/ascend/custom/CMakeLists.txt` pushes the main `src/` dir onto
  the interface target's `INCLUDES` property so the kernel TU can
  `#include "data_type.h"`.

- `src/ascend/custom/add_rms_norm/op_kernel/.clang-tidy`: disables all
  `clang-tidy` checks on `ccec`-compiled device code (absent from
  `compile_commands.json`, `__aicore__` macro parses incorrectly
  without `kernel_operator.h`).

Dev workflow: `pip install -e .[dev]` gives a fully working install on
Ascend; editing any custom-kernel source and re-running `pip install`
re-triggers the `ccec` build automatically.

* refactor(data_type): pin `DataType` enum values explicitly

The `AscendC` custom kernels forward `static_cast<int64_t>(input.dtype())`
to their `aclrtlaunch_*` entry points and dispatch on the same enum —
making `DataType`'s integer values part of a host↔device ABI.

Assign explicit values (`kInt8 = 0, …, kFloat64 = 11`) to pin that ABI:
reordering or inserting entries above existing ones would silently
change the integers seen by device code.  No behaviour change at call
sites (the enum is still accessed by symbolic name everywhere except
the `int64_t` cast).

* feat(ascend-custom): add bf16 support + Google-style identifier renames

bf16 was silently producing garbage / NaN on impl 1 (`rms_norm`) and
impl 2 (`add_rms_norm`): the kernels only instantiated `<half>` and
`<float>`, and the launchers mapped bf16 to the fp32 byte-size path,
so bf16 weight was read as if it were fp32 and the fp16 output cast
used `CAST_ROUND` (fp16-only alias).

Kernel dispatch:

- `op_kernel/rms_norm.cpp` / `op_kernel/add_rms_norm.cpp`: add a
  `KernelXxx<bfloat16_t>` instantiation; dispatch in the `extern "C"`
  entry is now `switch (static_cast<infini::ops::DataType>(dtypeCode))`
  (shared enum forwarded from the launcher via `int64_t`).  The
  fp16/bf16 branch uses `CAST_RINT` for the fp32 → T writeback —
  defined for both `half` and `bfloat16_t` destinations, whereas
  `CAST_ROUND` is a `half`-specific alias.

Launchers (`kernel_custom.h`):

- Store `DataType dtype_` (replaces the old `int64_t dtype_size_` which
  collapsed fp16 and bf16 onto the same code).
- Use `ascend::ToAclDtype(dtype_)` and `kDataTypeToSize.at(dtype_)`
  instead of hand-rolled ternaries (consistent with the rest of the
  Ascend backend).
- Forward `static_cast<int64_t>(dtype_)` as the kernel's `dtypeCode`.
- `extern "C" aclrtlaunch_*` forward-decl parameters renamed to
  `snake_case`; the function name itself is generated by
  `ascendc_add_operator(OP_NAME …)` and carries
  `// NOLINTNEXTLINE(readability-identifier-naming)` so `clang-tidy`
  accepts it.

Identifier naming (Google C++ Style):

- `op_kernel/*.cpp` members `snake_case_`, params / locals `snake_case`,
  constants `kPascalCase` (was `BUFFER_NUM` / `dimLength` / `inQueueX1`
  / `blockRows`, etc. — inherited from the `vllm-ascend` sample style).

Verified: `pytest tests/test_rms_norm.py tests/test_add_rms_norm.py
--devices ascend` → 144 passed / 0 failed (fp32 / fp16 / bf16 × both
ops × full shape + stride matrix).

* refactor(base): align Linear/SiluAndMul/AddRmsNorm/RotaryEmbedding with vLLM

Bring `src/base/*.h` interfaces and tensor conventions into strict alignment
with vLLM's public kernel contracts.  Derived Ascend kernels and tests
follow.  `generated/bindings/` will regenerate on next build.

- **`SiluAndMul`**: rename `x` → `input` (matches `F.glu(input, dim)`); add
  `(input, out)` overload with `dim = -1` default to match vLLM's hardcoded
  last-dim behavior.
- **`Linear`**: add vLLM-aligned `(input, weight, bias?, out)` overload with
  weight stored as `[out_features, in_features]` (identical to
  `F.linear(input, weight, bias)`).  Deprecated 6-arg
  `(a, b, bias, trans_a, trans_b, out)` form retained.  CPU and Ascend
  subclasses gain matching 4-arg ctors that delegate to the 6-arg form with
  `trans_a = false, trans_b = true`.
- **`AddRmsNorm`**: rename `other` → `residual` (matches vLLM's
  `fused_add_rms_norm(input, residual, weight, eps)` schema); add inplace
  `(input, residual, weight, eps)` overload that forwards to the
  out-of-place primary form with aliased buffers.
- **`RotaryEmbedding`**: reorder first six parameters to match vLLM's
  `rotary_embedding(positions, query, key?, head_size, cos_sin_cache,
  is_neox)` schema verbatim; `rotary_dim` / `query_out?` / `key_out?` /
  `pre_gathered` remain as InfiniOps extensions at the tail.  Added
  `positions.dtype() == int64` assert per vLLM convention.

Verified on NPU: `pytest tests/test_{silu_and_mul,add_rms_norm,rotary_embedding,linear}.py --devices ascend` → 295 passed, 4 skipped, 0 failed.

* refactor(base): trim narrative comments and collapse CPU Linear ctors

Follow-up to `c23901a`.  Per CLAUDE.md "default to writing no comments",
strip doc-comments that narrate the change or restate well-named
identifiers from the four refactored base headers.  Keep only the one
WHY comment in `rotary_embedding.h` explaining `pre_gathered`'s
index_select+neox precondition (the name alone doesn't carry it).

Also replace the two delegating ctors in `src/cpu/linear/linear.h` with
`using Linear::Linear;` — matches the pattern already used in
`src/cpu/{rms_norm,swiglu}/*.h`, `src/cuda/{rms_norm,causal_softmax}/*.h`.

Verified: `pytest tests/test_{silu_and_mul,add_rms_norm,rotary_embedding,linear}.py --devices ascend` → 295 passed, 4 skipped.

* fix(pr66-review): address review findings 1-3

- `tests/test_add_rms_norm.py`: extend `implementation_index` parametrize
  to `(0, 1, 2)`; add `_clear_add_rms_norm_cache` autouse fixture to
  avoid cross-test state pollution in the custom AscendC kernel (impl 2)
  whose cached fp32 weight buffer collides across tests with matching
  shape/dtype keys.  Coverage: +54 test cases (108 total, all green).

- `src/base/rotary_embedding.h`: assert `key.has_value()` with a TODO
  noting MLA is not yet implemented on any Ascend backend.  All three
  impls already assert `has_key_` individually; hoisting the check to
  base turns a silent crash (if a caller passes `key=None`) into a clean
  assert.  Keeps `std::optional<Tensor> key` in the signature for future
  MLA support without breaking vLLM API compatibility.

- `src/ascend/causal_softmax/kernel.h`: add justification for the
  3-primitive decomposition (no single CANN 8.5 API covers causal-mask
  + softmax; `aclnnSoftmaxV2` lacks the mask argument, and
  `aclnnScaledMaskedSoftmax` requires a pre-scaled attention score), per
  CLAUDE.md Ascend rule "never decompose when a fused API exists".

Verified: `pytest tests/test_{silu_and_mul,add_rms_norm,rotary_embedding,linear,causal_softmax}.py --devices ascend` → 349 passed, 4 skipped.

* refactor(pr66): drop `apply_rotary_pos_emb` wrapper + tests

The legacy `apply_rotary_pos_emb` shim existed only as a vllm-infini
compat alias after the `ApplyRotaryPosEmb` base op was folded into the
unified `RotaryEmbedding`.  vllm-infini is out of scope for this PR, so
drop the shim entirely:

- `scripts/generate_wrappers.py`: remove `_generate_apply_rotary_pos_emb_shim`
  and the `extra_shim` emission hook — the Python-level wrapper was
  ~60 lines of pybind C++ that concatenated cos/sin, synthesized
  `positions = arange(T)`, and forwarded to `rotary_embedding` with
  `pre_gathered=True`.  Callers that need the pre-gather fast path can
  invoke `infini.ops.rotary_embedding(..., pre_gathered=True)` directly.
- `tests/test_rotary_embedding.py`: remove `test_apply_rotary_pos_emb` /
  `test_apply_rotary_pos_emb_3d` and the `_expand_cos_sin` helper that
  only those tests used.  `pre_gathered=True` remains exercised
  indirectly via `test_rotary_embedding_full` when impl 2 requires the
  caller to pre-gather (handled internally by the kernel).
- Touch up two stale `apply_rotary_pos_emb shim` comments in
  `kernel{,_atb}.h` that no longer point anywhere.

Verified: `pytest tests/ --devices ascend` → 2278 passed, 1612 skipped
(was 2306 / 1612 — delta is the 28 removed `apply_rotary_pos_emb` cases).

* test(rotary_embedding): add `pre_gathered=True` coverage

Fold the deleted `test_apply_rotary_pos_emb` / `_3d` cases into a single
`test_rotary_embedding_pre_gathered` that exercises the `pre_gathered`
fast path directly on the `rotary_embedding` overload (no shim).
Parametrize over 2D / 3D query-key layouts, impls 0 and 1 (impl 2 asserts
`!pre_gathered_`), neox / GPT-J styles, fp16 / bf16.  The new
`_build_pre_gathered_cache` helper constructs the `[2*T, head_size]`
wire format that `src/ascend/rotary_embedding/kernel.h` expects —
cos rows 0..T-1, sin rows T..2T-1, both neox-expanded per token.

Coverage: 12 new cases pass (4 skip for `impl=0 + not-neox`, same as the
`test_rotary_embedding_full` skip — V2 only supports `rotaryMode="half"`).

Full rotary suite: 88 passed, 8 skipped (was 80 passed, 4 skipped before
this test was added).

* chore(pr66): drop unused headers

- `src/base/add_rms_norm.h`: `#include <cstddef>` — no `size_t` usage.
- `src/base/rotary_embedding.h`: same.
- `src/ascend/add_rms_norm/kernel_custom.h`: `#include <vector>` — no
  `std::vector` / `std::array` usage.

Build + 355 passed / 8 skipped on Ascend unchanged.

* style(pr66): sweep assert-message periods + comment backticks

Addresses inline review comments on InfiniTensor#66 (reviewer: Ziminli) across all
PR-touched files:

- C4: strip trailing periods from assert messages; lowercase the
  sentence-starting word when it is bare English (e.g. "Ascend ..." →
  "ascend ..."), leave backticked identifiers untouched.
- G4: backtick `RmsNorm` in kernel_custom.h header comment; backtick
  `aclnn` / `cos_sin_cache` / `infini.ops.add_rms_norm(...)` in kernel
  comments that were still running raw text.
- C2: rename `aclrtlaunch_add_rms_norm` / `aclrtlaunch_rms_norm`
  forward-decl parameter names from AscendC internals (`x1, x2, y,
  x_out`) to the base-header semantic names (`input, residual, weight,
  out, residual_out`).  The extern "C" symbol is name-blind so the
  AscendC kernel .cpp can keep its local names — the wrapper .h just
  presents the public contract.
- Pre-gathered rotary test: drop the hardcoded
  `implementation_index=(0, 1)` parametrize, let conftest auto-inject
  and skip impl 2 inline (the impl 2 kernel asserts
  `!pre_gathered_`).

Verified locally (`--gpu-id 3/4/5 --local`):
  test_add_rms_norm.py:      108 passed
  test_rms_norm.py:            72 passed
  test_rotary_embedding.py:    88 passed, 16 skipped (expected:
                                          impl 2 + pre_gathered,
                                          impl 0 + non-neox)

* refactor(pr66): rename AscendC custom kernels to PascalCase + C2 param order

Addresses Ziminli's comment on `aclrtlaunch_add_rms_norm` forward-decl
(InfiniTensor#66 discussion 3115868675 / 3129096521):

- **函数名格式:** the AscendC kernel entry-points `add_rms_norm` /
  `rms_norm` are renamed to `AddRmsNorm` / `RmsNorm`.  The AscendC
  toolchain prepends `aclrtlaunch_` on the symbol regardless of case,
  so the exported names become `aclrtlaunch_AddRmsNorm` /
  `aclrtlaunch_RmsNorm` — matching the base-class names and
  `readability-identifier-naming.FunctionCase = CamelCase`.
  The `NOLINTNEXTLINE(readability-identifier-naming)` shim and the
  "PascalCase rule does not apply" apology comments go away.

- **参数列表顺序 (C2):** reorder parameters to `inputs, attributes,
  outputs`.  Both `.cpp` kernel entry, `KernelAddRmsNorm::Init` /
  `KernelRmsNorm::Init`, and the `extern "C"` forward-decl in
  `kernel_custom.h` are updated together, along with the call sites
  in `operator()`.

- **Variable naming (`.cpp` internals):** `x1/x2/y/x_out` →
  `input/residual/out/residual_out`; `x/y` → `input/out`.  Cascaded
  through member names (`*_gm_`, `*_queue_*`, `*_local`) for
  consistency — internal to each kernel class, no ABI impact.

- **`op_host/*.cpp`:** updated to include the PascalCase generated
  header `aclrtlaunch_AddRmsNorm.h` / `aclrtlaunch_RmsNorm.h` and to
  match the reordered `EXEC_KERNEL_CMD` argument list.

Verified locally with `.ci/run.py --local`:
  test_add_rms_norm.py:      108 passed
  test_rms_norm.py:            72 passed

The AscendC toolchain successfully compiles the PascalCase kernel
entries and generates matching launch headers — the
`aclrtlaunch_<ENTRY>` macro concatenates regardless of case.

* refactor(pr66): trim commit-narration comments

/simplify found 4 comment blocks that narrate the rename rationale
rather than encode load-bearing contracts:

- `kernel_custom.h` forward-decl — compress build-system detail
  (`no_workspace_kernel`, `ascendc_library()`) to one line, keep only
  the ABI contract (`aclrtlaunch_<Entry>` is generated by AscendC from
  `op_kernel/`).
- `op_host/<op>.cpp` `EXEC_KERNEL_CMD` — drop "Parameter order follows
  the base class: inputs, attributes, outputs."; the signature itself
  is self-evident.
- `op_kernel/<op>.cpp` kernel entry — drop "Parameters follow the C2
  convention ..." and "`aclrtlaunch_AddRmsNorm` matches the base
  `AddRmsNorm` class name"; these are commit-message material, not
  comments.

---------

Co-authored-by: zhangyue <zhangyue@example.com>
* chore: add pull request template

* chore: escape `#` in PR template to suppress GitHub issue auto-linking

* chore: remove `\#` in PR template
Co-authored-by: zhangyue <zhangyue@qiyuanlab.com>
* refactor: move native hardware platforms under `src/native/`

* refactor: split each platform into scaffolding root and `ops/` subdirectory

* refactor: update `#include` paths after the new layout

* build: update glob patterns in `src/CMakeLists.txt` for new layout

* build: update `scripts/generate_wrappers.py` for the new layout

Operator headers now live at `<platform>/ops/<op>/<file>.h` instead of
`<platform>/<op>/<file>.h`, so the platform name is one level deeper.

* refactor: update `#include` paths in `examples/runtime_api.h`

* style: re-sort `#include` directives after path changes to satisfy `clang-format`

* docs: update operator implementation paths in `CONTRIBUTING.md`

* fix: rename `cuda/runtime.h` to `cuda/runtime_.h` to avoid name collision with framework `runtime.h`

The shared CUDA scaffolding header `cuda/runtime.h` defines the
`CudaRuntime` CRTP base, which inherits from `DeviceRuntime` declared in
the framework-level `runtime.h`. Bringing the framework header in via
`#include "../runtime.h"` violates the Google C++ Style Guide's
prohibition on relative include paths, and rewriting to
`#include "runtime.h"` instead self-references because quoted includes
search the current file's directory first.

Renaming the file to `runtime_.h` matches the trailing-underscore
convention used by every other platform-internal scaffolding header
(`device_.h`, `data_type_.h`, `caster_.h`, etc.) and removes the
collision so `#include "runtime.h"` resolves to the framework header
through the project include path. Update the four sub-platform
`runtime_.h` files that include it accordingly.
@gongchensu gongchensu force-pushed the feat/linear-multi-backend branch from 873d79b to 37e5b3d Compare May 9, 2026 09:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants